White Wine Quality by Diogo Cosin Ayres de Oliveira

This report explores a dataset containing information about 4898 white wines provided by Cortez et al. (2009). Each observation is described by its chemical properties and experts quality review.

First, some numbers about the data set.

## [1] 4898   13

The data set contains 4898 observation with 13 attributes including one for index and another for the quality grade.

Below, the summary statics about these 13 attributes.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Univariate Plots Section

Wine Quality

Each wine is rated in a 0 (very bad) - 10 (very excellent) grade by at least 3 wine experts.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

We see a normal distribution in the quality attribute histogram. Most wines are reviewed with quality around 6. The maximum quality observed is 9 while minimum is 3. No wine received a perfect 10 review and just a few has got a 9 review. Let’s try to understand throughout this EDA report what factors produce better wines, according to experts opinions.

Acidity Attributes

Now let’s visualize the acidity chemical properties of our data set wines.

Bellow, it is presented the summary statics of these three attributes.

summary(df_ww$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200
summary(df_ww$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
summary(df_ww$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

It is noticed a normal distribution in the acidic wines attributes, however we can see too many residues values to the right causing the maximum values to be distant from median values. For instance, for fixed acidity the maximum value observed was 14.2 while the median was 6.8 and the 3rd quartile, 7.3. We see the same behavior for volatile acidity (median of 0.26, 3rd quartile of 0.32 and maximum of 1.1) and citric acid (median of 0.32, 3rd quatile of 0.39 and maximum of 1.66). Can the less frequency bins indicate a higher quality wine once most wines present median attributes values? Maybe in multivariate analysis we will find some relationship between these attributes and the wine quality.

Sulfur and Sulphate Attributes

Following the uni-variate exploration, let’s plot the histogram distribution of the sulfur and sulphate attributes.

Summary stats for the sulfur and sulphates attributes.

summary(df_ww$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
summary(df_ww$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
summary(df_ww$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Again, as we have seen for the acidic attributes, most wine have same characteristics regarding its sulphates attributes given that the observations accumulate around median values. Soon, as the quality distribution is approximately normal, we may expect that wine with atypical acidic attributes are better reviewed than median ones and we will test in multivariate analysis. Regarding the distributions, we notice that free sulfur dioxide presents a long right tail causing the maximum value to be 289 while the median is 34 and the 3rd quartile, 46. On the other hand, we don’t have the same pattern for the left side as the 1st quartile is 23 and the minimum value, 2.

pH and Alcohol

Let’s plot the related pH and alcohol attributes.

summary(df_ww$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
summary(df_ww$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Trough the plots we can see that the distribution curve of pH attribute is normal with median in 3.18, mean in 3.188, 1st quartile in 3.09 and 3rd quartile in 3.28. Alcohol curve however is positively skewed with median in 10.4, mean in 10.51. It is interesting that both plots don’t present same distribution since I’ve expected that alcohol content would be highly related to the pH. However, the different distribution shapes show the opposite.

Residual Sugar, chlorides and density

Finally, let’s explore the remaining attributes distributions: residual sugar, chlorides and density.

summary(df_ww$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
summary(df_ww$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
summary(df_ww$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

We can see that all attributes present a approximately normal distribution with exception of residual sugar attribute. Residual sugar is highly positively skewed causing median in 5.2 to diverge from mean in 6.391. In order to see residual sugar behavior on others regions, let’s rearrange our plot with a log scale on x-axis.

With log10 x scale it’s possible to notice a bi-modal distribution with most wines having residual sugar between 1 and 2 or between 7 and 15 showing that some wines present higher sugar amount than others.

Bivariate e Multivariate Plots Section

Correlation Matrix

Before starting to analyse the bi-variate plots, let’s produce the correlation matrix using Pearson method in order to have initial correlation indexes between attributes and to focus our multivariate analysis on those attributes.

Through the matrix we can see strong correlations between some attributes. For instance, density and alcohol present linear correlation of -0.78. Density and residual sugar, linear correlation of 0.84. Quality and alcohol, linear correlation of 0.43.

Creating Sulphates and Residual Sugar buckets

In order to help in our multivariate exploration, let’s bucket the residual sugar and sulphates buckets so that we color others scatter plots attributes using these buckets as references.

df_ww$quality <- factor(df_ww$quality)
df_ww$sulphates.bucket <- cut(df_ww$sulphates, breaks=c(seq(0.2,1.1,0.2)))
df_ww$residual.sugar.bucket <- cut(df_ww$residual.sugar, breaks=c(seq(0,14,3)))
str(df_ww$sulphates.bucket)
##  Factor w/ 4 levels "(0.2,0.4]","(0.4,0.6]",..: 2 2 2 1 1 2 2 2 2 2 ...
str(df_ww$residual.sugar.bucket)
##  Factor w/ 4 levels "(0,3]","(3,6]",..: NA 1 3 3 3 3 3 NA 1 1 ...

Density Exploration

Density is expected to have strong linear relationship with residual sugar and alcohol, given that these last two attributes alter the wine water density. Furthermore, according to the correlation matrix, density present linear correlation coefficient of 0.84 with residual sugar and -0.78 with alcohol. Let’s produce scatter plots in order to visualize these relationships.

As expected, density presents strong linear correlation with residual sugar and alcohol. Given that these two factors acts directly in changing the water density, due to chemical concepts, it’s possible to say that density holds a causation relationship with them.

pH and Fixed Acidity Exploration

Also, through the matrix, we see a strong correlation index between pH and fixed acidity. Again, chemical concepts support these relationship once acidity influences the pH substance. Let’s visualize it.

The linear correlation of -0.43 is confirmed and we notice that most wines have pH between 3.0 and 3.3. Also, the relationship is not so strong as expected. Maybe others attributes besides fixed acidity are influencing the wines pH.

First, let’s see trough a scatter plot how sulphates effect the pH vs Fixed Acidity distribution.

It is hard to detect some pattern on how sulphates alter a pH wine. Trough the previous plot it is not possible to find any tendency.

Proceeding with the pH vs fixed acidity exploration, let’s now color our scatter plot with the residual sugar attribute as reference.

Through the previous plot, we see a tendency where wines with higher residual sugar amount present lower pH (more acid). However this tendency is weak and we can’t see a clear pattern of how residual sugar influences the pH. This means that in addition to the fixed acidity, residual sugar may be acting in a wine pH, despite not having strong linear correlation.

Quality Exploration

Now let’s explore how wine quality relates with some attributes. First, as detected by the correlation matrix, quality presents strong linear correlation with alcohol. Let’s visualize this relation in order to confirm it.

In fact, we see that wines with more alcohol presence tend to be better reviewed by the experts as the mean wine quality increase due to the linear correlation coefficient of 0.44. We also see some discrepancy between mean and median for wines with 5% and 8% alcohol volume due to the outliers found.

Carrying on with the quality exploration, let’s visualize the relation with the density, expecting to present strong correlation once the linear correlation between them (provided by the correlation matrix) is -0.31 and also considering that density correlates to alcohol, as showed in previous bi variate plots.

We clearly can notice that density median tend to decrease as quality increases. This confirms the strong linear correlation index of -0.31 between them and also the correlation with alcohol.

Now, considering that chlorides also presented relatively strong linear correlation with quality, let’s plot this relationship in order to check if it is confirmed.

ggplot(aes(x=quality, y=chlorides), data=df_ww) +
  geom_jitter(alpha=.2) +
  geom_boxplot( alpha = .5,color = 'blue')+
  stat_summary(fun.y = "mean", 
               geom = "point", 
               color = "red", 
               shape = 8, 
               size = 2.5) +
  ylim(0,quantile(df_ww$chlorides,0.99)) +
  ggtitle('Density vs Quality') +
  xlab('Quality') +
  ylab('Chlorides') +
  theme(plot.title = element_text(hjust=0.5))
## Warning: Removed 48 rows containing non-finite values (stat_boxplot).
## Warning: Removed 48 rows containing non-finite values (stat_summary).
## Warning: Removed 49 rows containing missing values (geom_point).

Again the visualization certified that chlorides is relatively strong correlated (compared to others attributes) with linear correlation index of -0.21. We also see trough medians and mean in the plot that, due to higher outliers, means are higher than medians, as showed below.

This plot show the interesting strong correlation between alcohol. It’s easy to notice that wine quality increases as alcohol volume increases and density decreases. However, we see some outliers of good quality with low alcohol amount and high density. Maybe others factors like residual sugar and fixed acidity also influence in wine quality although not presenting strong linear correlation.

So, in order to explore which others factor make a high quality wine, let’s produce other plots replacing density by them.

Those outliers are still there with high residual sugar amount and low alcohol. So residual sugar by itself doesn’t explain them.

Let’s see now volatile acidity.

This plot show a interesting slight tendency where wine with low volatile acidity amount on lower alcohol present better quality.

It is really difficult to find some pattern on total sulfur dioxide’s influence on wine quality. As detected by previous scatter multivariate plots, alcohol strong correlates with wine quality, however we don’t see clearly in the this plot how total sulfur dioxide exactly acts on wine quality.

Final Plots and Summary

Plot One

Description One

Through this plot we can see that wine quality strongly correlates with alcohol amount. Quality tend to increase as alcohol increases. Moreover, according to the correlation matrix produced previously on this report, alcohol is the attribute that presents the most substantial correlation with quality wine.

Plot Two

Description Two

The correlation matrix also identified really strong correlations between density and alcohol or residual sugar. In fact, this is expected given that alcohol and residual sugar alter the water wine density due chemical concepts. Visualizing this relationship on the previous plots, it is possible to confirm the correlation showing that density tend to increase as residual sugar increases or tend to decrease as alcohol increases.

Plot Three

Description Three

Again we see the relevant correlation between alcohol and quality, however this plot also show a slight tendency where quality tend to increase as volatile acidity decreases at lower alcohol volumes amounts. This fact shows that, besides alcohol, others attributes contribute to the wine quality and when mixed may determine a high quality wine.


Reflection

The data set provided by Cortez et al. (2009) contains attributes of 4,898 white wines. With this data collection it was possible to explore how this chemical attributes correlate between them and also how they impact in a white wine quality. Each wine was reviewed by experts with grades between 1 and 10. Initially, the exploration approached the attributes distribution with uni-variate histogram plots aid. Most attributes presented normal distribution with residual sugar exception. After that, through the matrix correlation and bi-variate scatter plots, the correlations between the attributes were analyzed. Some chemical concepts were confirmed by really strong correlations, as density depending on residual sugar and alcohol. Also, still on bi-variate analysis, it was surprising to discover that alcohol produced the highest correlation with quality wine. It was really hard to detect some pattern in correlation between wine quality and other attributes, but thanks to multivariate scatter plots, it was possible to notice that other attributes also acts on wine quality. Volatile acidity contributes in a higher wine quality given that low volatile acidity tends to increase quality. However, others multivariate plots failed in showing strong patterns and correlation and was not possible to gather some insights and resolutions trough them. Thinking in future works, this data set could be pontentialized with more wine attributes observations so that a quality prediction model could be made using machine learning techniques.

References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.